Background
Python stands at the forefront of data science and machine learning, not just because of its simplicity but also due to its powerful suite of functions and methods. We will explore some of the most useful functions including map()
, filter()
, and zip()
, along with the elegance of list and dictionary comprehensions, the power of iterators and generators and the versatility of decorators. We’ll delve into their syntax, provide simple examples, and demonstrate their practical application in data science and machine learning to harness Python’s capabilities for data-driven projects.
You can run the code below in a Jupyter Notebook via Binder here
map()
& filter()
The map()
function takes two inputs, the function to apply and the iterable (e.g. list, set, dictionary, tuple, string) to apply it to, the returned result will be an iterator, in this case, a map object containing the new output. The input function can be any callable function including built-in functions, lambda
functions, user-defined functions, classes and methods. Essentially, map()
iterates over the iterable and applies the input function to each element in the iterable, just like a for loop might do.
syntax: map(function, iterable)
## Simple example
# Define a function to calculate the square of a number
def square(x):
return x ** 2
# Create a list of integers
= [1, 2, 3, 4, 5]
numbers
# Apply the square function to each element in the map object and convert the result to a list
= list(map(square, numbers))
squared_numbers # Output: [1, 4, 9, 16, 25] squared_numbers
Objects returned by map
and filter
are iterators which means their values are not stored but generated as needed.
In this example, we demonstrate the use of map()
in a typical data processing scenario common in machine learning. The goal here is to normalise a list of height data. Normalisation, especially Z-score normalisation, is a crucial step in preparing data for many machine-learning algorithms. It involves scaling the data so that it has a mean of 0 and a standard deviation of 1. Therefore, all features contribute equally to the model’s performance.
## Example in Data Preprocessing
= [170, 155, 180, 190, 162]
heights
# Function to calculate the mean and standard deviation
def calculate_mean_std(data):
= sum(data) / len(data)
mean = (sum([(x - mean) ** 2 for x in data]) / len(data)) ** 0.5
std return mean, std
= calculate_mean_std(heights)
mean_height, std_height
# Define a function to normalise the data
def normalize(x):
return (x - mean_height) / std_height
= list(map(normalize, heights))
normalized_heights # Output: [-0.112, -1.313, 0.688, 1.489, -0.752] normalized_heights
filter()
is a function that selectively passes elements from an iterable through a test function. Like map()
, it takes a function and an iterable as arguments. However, instead of transforming each element, filter()
checks each element against the test function and returns only those for which the function evaluates to True. In simple terms, filter()
sifts through an iterable, keeping only the elements that meet a specified condition.
syntax: filter(function, iterable)
In a machine learning context, we can use filter()
for feature selection, data cleaning or removing outliers. For instance, we might want to filter out abnormal values which could be due to errors or outliers. These readings are crucial for training a ML model and removing them is a common preprocessing step.
# Sample data: temperature readings in Fahrenheit
= [102, 95, 101, 150, 99, 98, 200, 105]
sensor_readings
# Function to convert Fahrenheit readings to Celsius
def fahrenheit_to_celsius(f):
return (f - 32) * 5 / 9
# Function to check if the temperature is within the normal range
def is_normal_temp(c):
return -5 <= c <= 40
# First, convert all readings from Fahrenheit to Celsius using map
= list(map(fahrenheit_to_celsius, sensor_readings))
celsius_readings # Output: [38.8, 35.0, 38.3, 65.5, 37.2, 36.6, 93.3, 40.5]
celsius_readings
# Next, filter out abnormal readings using filter
= list(filter(is_normal_temp, celsius_readings))
normal_celsius__readings # Output: [38.8, 35.0, 38.3, 37.2, 36.6] normal_celsius__readings
Overall, map()
& filter()
provide both a more Pythonic and efficient way of handling data transformations and filtering. They are considered a more elegant alternative to traditional loops, aligning with Python’s emphasis on readability and simplicity.
Note: “Pythonic” refers to a coding style and practice that leverages Python’s unique features to write code that is clear, concise and readable. For comprehensive guidelines on adhering to these conventions refer to the PEP 8 Style Guide
lambda functions
lambda()
functions, also known as anonymous functions are defined using the lambda keyword and are especially useful as one-off functions that don’t need to be named or reused. They are typically used where functions are required for a short period within a larger expression and are best used sparingly, in situations where they enhance readability and conciseness.
syntax: lambda parameters: expression
Previously, we used both map()
and filter()
in conjunction with user-defined functions, however, we can achieve the same result using lambda
. This approach is more concise, eliminating the need for a separate function definition.
# Method 1 using def
# Define a function to calculate the square of a number
def square(x):
return x ** 2
# Create a list of integers
= [1, 2, 3, 4, 5]
numbers
# Apply the square function to each element in the map object and convert the result to a list
= list(map(square, numbers))
squared_numbers_1
# Method 2 using lambda
= list(map(lambda x: x ** 2, numbers)) squared_numbers_2
In this example, lambda
performs the same check as even()
within the filter
call. This approach is more streamlined and avoids the overhead of defining and naming a separate function.
# Method 1 using def
def even(x):
return x % 2 == 0
= [1, 2, 3, 4, 5]
numbers
# Apply the even function to each element in the filter object and convert the result to a list
= list(filter(even, numbers))
filtered_numbers_1
# Method 2 using lambda
= list(filter(lambda x: x% 2 ==0, numbers)) filtered_numbers_2
Despite the ability of lambda to make code more readable and expressive, particularly when utilised in common programming patterns involving map()
, filter()
and sorted()
, lambda
functions are not a one-size-fits-all solution and should not be used judiciously to maintain code clarity. Here’s an example of when the use of lambda would not be appropriate:
# Using def
def process_data(data):
# Check if data is empty or None, raise an error if True
if not data:
raise ValueError("No data provided")
# Apply a complex data transformation to each data item if a condition is met
= [complex_transformation(d) for d in data if condition(d)]
transformed
# Aggregate transformed data
= aggregate(transformed)
result
#Post process aggregated result
= post_process(result)
post_processed_result
return post_processed_result
# Using lambda
= lambda data: post_process(aggregate([complex_transformation(d) for d in data if condition(d)])) if data else ValueError("No data provided") process_data_lambda
List & Dictionary Comprehensions
List and dictionary comprehensions are concise ways to create lists and dictionaries in Python. They offer a more readable and efficient way to generate these collections compared to traditional loops and functions. Comprehensions follow the principle of writing simple expressions that transform or filter sequence elements.
syntax: new_list = [expression for item in iterable if condition == True]
syntax: new_dict = {key, value for key,value in iterable if condition == True}
The expression can be any function or operation applied to the item, also the condition checking aspect of list and dict comprehensions is optional.
In some simple cases, using a list comprehension with a conditional can be thought of as using map()
and filter()
in conjunction. As the example below shows, we are squaring all even elements in a specified range.
# Simple example
= [x**2 for x in range(5) if x % 2 == 0]
squares # Output: [0, 4, 16] squares
The equivalent approach using map()
and filter()
looks like this, it’s evident that the list comprehension is more readable and succinct, making it a preferred choice in such scenarios.
= map(lambda x: x**2, filter(lambda x: x % 2 == 0, range(5)))
squares = list(squares) # Output: [0, 4, 16] squares
For dictionary comprehension, the syntax is similar except we use curly braces {}
instead of square brackets []
which are typically used for lists instead, and in this case, we omitted the optional conditional.
# Simple example
# Dictionary inversion, keys become values and values become keys
= {"a": 1, "b" :2, "c" : 3}
my_map
= {value: key for key, value in my_map.items()}
inverted_map # Output: {1: 'a', 2: 'b', 3: 'c'} inverted_map
In many machine learning models, especially those based on algorithms that require numerical input, you need to convert categorical data (like colour names) into numerical form. One simple approach is to create a binary encoding for a categorical feature, which can be simply implemented with a list or dict comprehension.
# ML related example
= ['red', 'green', 'blue', 'yellow', 'red']
colours
= [1 if colour == 'red' else 0 for colour in colours]
encoded_colours # Output: [1, 0, 0, 0, 1] encoded_colours
This example demonstrates a simple one-hot encoding scenario. List comprehension makes it not only readable but also faster, which is crucial when encoding large datasets typical in machine learning tasks.
= ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']
categories
= {category: idx for idx, category in enumerate(set(categories))}
category_encoding # Output: {'orange': 0, 'banana': 1, 'pear': 2, 'apple': 3} category_encoding
In summary, list and dictionary comprehensions are not only about writing more concise code; they also offer performance benefits. These benefits stem from the way Python optimises comprehensions at the bytecode level. When a comprehension is compiled, Python generates specialised bytecode. This bytecode is inherently more efficient for executing loops and performing condition checks, compared to the more general-purpose bytecode generated for loops with embedded if-else blocks. The reason behind this efficiency is the predictable pattern of comprehensions—they consistently involve iteration and, often, condition application. This predictable structure allows Python to streamline the execution process at the bytecode level, resulting in faster performance.
zip()
The zip()
function accepts iterators as its arguments and returns a zip object (an iterator of tuples) where the first item from each passed iterator is paired together, and then the second item in each passed iterator are paired together and so on until we reach the length of the iterable with the least items, this also decides the length of the output iterator. Essentially, the function returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the input iterators.
syntax: zip(iterator_1, iterator_2, iterator_3)
# Simple example pairing fruis with their colours
= ['apple', 'banana', 'grape', 'pear']
fruits = ['red', 'yellow', 'purple']
colours
= zip(fruits,colours)
paired_fruits
list(paired_fruits)
# Output: [('apple', 'red'), ('banana', 'yellow'), ('grape', 'purple')]
There’s no restriction on the number of iterators you can provide to Python’s zip()
function as input arguments.
Suppose you are working on a machine learning problem where you need to combine features from different datasets or feature sets. For instance, you might have one set of features coming from survey data and another set from transactional data, and you need to pair these features for each individual in your dataset. This is where the zip()
function comes into use.
= [(25, 'M'), (30, 'F'), (22, 'F')] # Example: (Age, Gender)
survey_features = [(1200, 300), (2000, 500), (1500, 400)] # Example: (Annual Spending, Number of Transactions)
transactional_features
# Combine features using zip
= zip(survey_features, transactional_features)
combined_features
for survey, transaction in combined_features:
print(f"Survey Data: {survey}, Transactional Data: {transaction}")
# Output: Survey Data: (25, 'M'), Transactional Data: (1200, 300)
# Survey Data: (30, 'F'), Transactional Data: (2000, 500)
# Survey Data: (22, 'F'), Transactional Data: (1500, 400)
In this example, zip()
is used to combine data from two different feature sets, creating a paired iterable. For each individual, you now have a tuple containing both their survey data and transactional data. This can be particularly useful in feature engineering, where you might need to create a unified feature set from different sources.
Iterators & Generators
As previously discussed, functions like map
and filter
in Python return iterator objects. This ties back to our discussion on iterators, which are fundamental to Python’s handling of sequences and collections. An iterator in Python is an object that can be iterated over, meaning you can traverse through all the values it holds. This is facilitated by the __iter__()
method, which returns the iterator object itself, and the __next__()
method, which retrieves the next value from the iterator and raises a StopIteration
exception when there are no more values.
# Simple example
class CountDown():
def __init__(self, start):
self.current = start
def __iter__(self):
return self
def __next__(self):
if self.current <= 0:
raise StopIteration
else:
self.current -= 1
return self.current
for number in CountDown(3):
print(number)
# Output: 2 1 0
Iterators are particularly useful in handling large data streams that might not fit entirely in memory.
# ML related example
class BatchIterator():
"""
An iterator that yields batches of data from a dataset.
This is useful in machine learning for processing large datasets in mini-batches.
"""
def __init__(self, dataset, batch_size):
self.dataset = dataset
self.batch_size = batch_size
self.index = 0
def __iter__(self):
return self
def __next__(self):
if self.index >= len(self.dataset):
raise StopIteration
= self.dataset[self.index:self.index + self.batch_size]
batch self.index += self.batch_size
return batch
Generators are a more concise way to create iterators using functions. A generator is a function that behaves like an iterator, generating a sequence of values using the yield keyword. Unlike regular functions that return a single value and then terminate, generators can maintain state in local variables across multiple calls, making them ideal for efficient data processing, here’s some examples:
# Simple example
def countdown_generator(start):
= start
current while current > 0:
yield current
-= 1
current
for number in countdown_generator(3):
print(number)
# Output: 3 2 1
# ML related example
def data_augmentation_generator(dataset, augment_func):
"""
A generator for performing data augmentation on the fly.
This is useful in machine learning for augmenting data during training without
needing to store the augmented data in memory.
"""
for data in dataset:
yield augment_func(data)
Overall
Iterators are objects that support the iterator protocol (
__iter__
and__next__
methods), while generators are functions that yield values usingyield
.Creation: Iterators are created using classes, requiring explicit definitions of
__iter__
and__next__
, whereas generators are created using functions, which is generally more concise.Use Case: Iterators are ideal for more complex or stateful iteration, while generators are suitable for simpler cases where a function can manage the state.
Commonality: Both iterators and generators provide a way to handle large datasets efficiently, enabling the processing of data that doesn’t fit in memory, and both follow lazy evaluation, computing one element at a time on demand.
Decorators
A decorator takes an existing function as an argument and extends its behaviour without explicitly modifying it. Decorators are a powerful feature that allows for the modification or enhancement of functions or methods in a clean, readable and maintainable way.
Uses
Code Reusability and Separation of Concerns: Decorators can add functionality to existing functions or methods, allowing for code reuse and separation of concerns.
Logging and Debugging: They are widely used for logging function calls, which is helpful for debugging.
Performance Testing: Decorators can be used for timing functions, which is crucial in performance testing and optimisation.
Data Processing and Pipeline Setup: In data science, decorators can streamline the setup of data processing pipelines, making the code cleaner and more modular.
# Simple example
def my_decorator(func):
def wrapper():
print('Something is happening before the function is called.')
func()print('Something is happening after the function is called.')
return wrapper
@my_decorator
def say_hello():
print('Hello!')
say_hello()
# Output:
# something is happening before the function is called.
# hello!
# something is happening after the function is called.
# Simple example
def my_decorator(func):
def wrapper():
print('Something is happening before the function is called.')
func()print('Something is happening after the function is called.')
return wrapper
@my_decorator
def say_hello():
print('Hello!')
say_hello()
# Output:
# something is happening before the function is called.
# hello!
# something is happening after the function is called.
In machine learning and data science, decorators have practical applications in performance monitoring, logging, and setting up data processing pipelines. By understanding and utilising decorators, you can write more efficient and Pythonic code.